Clustering multilingual documents by estimating text - to - text semantic relatedness
نویسنده
چکیده
This thesis is about multilingual document clustering through estimating semantic relatedness between multilingual texts. Specifically we focus on the task of clustering multilingual documents with very limited or no supervisory information. We present two approaches to address the problem : a comparable-corpora based approach and a web-searches based approach. Our first approach derives pairwise constraints from comparable corpora and cluster multilingual documents in a semi-supervised manner. The method models document collections as weighted graph, and supervisory information is given as sets of must-link constraints for documents in different languages. Recursive k-nearest neighbor similarity propagation is used to exploit the prior knowledge and estimate semantic relatedness of multilingual documents. Spectral method is then applied to find the best cuts of the graph. Our second approach to multilingual document clustering uses web-searches to estimate semantic relatedness between multilingual words. In this approach, we extract informative terms from each language involved and query a search engine using each pair of extracted terms as keywords to construct a web-count based word similarity matrix. A variant of hierarchical agglomerative clustering algorithm is then applied to discover multilingual word clusters, and the resulting word clusters are utilized as features to perform document clustering. Evaluation of experimental results using various evaluation measures suggests that the proposed algorithms achieve satisfactory clustering result and outperforms existing methods which utilize similar supervisory information. Furthermore, since we do not use any language dependent information in the clustering process, our algorithm can be applied to documents which are written in different writing systems, such as Japanese and English texts.
منابع مشابه
Semantic smoothing for text clustering
In this paper we present a new semantic smoothing vector space kernel (S-VSM) for text documents clustering. In the suggested approach semantic relatedness between words is used to smooth the similarity and the representation of text documents. The basic hypothesis examined is that considering semantic relatedness between two text documents may improve the performance of the text document clust...
متن کاملSemantic-Based Multilingual Document Clustering via Tensor Modeling
A major challenge in document clustering research arises from the growing amount of text data written in different languages. Previous approaches depend on language-specific solutions (e.g., bilingual dictionaries, sequential machine translation) to evaluate document similarities, and the required transformations may alter the original document semantics. To cope with this issue we propose a ne...
متن کاملA Joint Semantic Vector Representation Model for Text Clustering and Classification
Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...
متن کاملخوشهبندی اسناد مبتنی بر آنتولوژی و رویکرد فازی
Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...
متن کاملDiscovering Parallel Text from the World Wide Web
Parallel corpus is a rich linguistic resource for various multilingual text management tasks, including crosslingual text retrieval, multilingual computational linguistics and multilingual text mining. Constructing a parallel corpus requires effective alignment of parallel documents. In this paper, we develop a parallel page identification system for identifying and aligning parallel documents ...
متن کامل